13 research outputs found
Long Story Short: a Summarize-then-Search Method for Long Video Question Answering
Large language models such as GPT-3 have demonstrated an impressive
capability to adapt to new tasks without requiring task-specific training data.
This capability has been particularly effective in settings such as narrative
question answering, where the diversity of tasks is immense, but the available
supervision data is small. In this work, we investigate if such language models
can extend their zero-shot reasoning abilities to long multimodal narratives in
multimedia content such as drama, movies, and animation, where the story plays
an essential role. We propose Long Story Short, a framework for narrative video
QA that first summarizes the narrative of the video to a short plot and then
searches parts of the video relevant to the question. We also propose to
enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art
supervised models by a large margin, highlighting the potential of zero-shot QA
for long videos.Comment: Published in BMVC 202
Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms
Commonsense norms are defeasible by context: reading books is usually great,
but not when driving a car. While contexts can be explicitly described in
language, in embodied scenarios, contexts are often provided visually. This
type of visually grounded reasoning about defeasible commonsense norms is
generally easy for humans, but (as we show) poses a challenge for machines, as
it necessitates both visual understanding and reasoning about commonsense
norms. We construct a new multimodal benchmark for studying visual-grounded
commonsense norms: NORMLENS. NORMLENS consists of 10K human judgments
accompanied by free-form explanations covering 2K multimodal situations, and
serves as a probe to address two questions: (1) to what extent can models align
with average human judgment? and (2) how well can models explain their
predicted judgments? We find that state-of-the-art model judgments and
explanations are not well-aligned with human annotation. Additionally, we
present a new approach to better align models with humans by distilling social
commonsense knowledge from large language models. The data and code are
released at https://seungjuhan.me/normlens.Comment: Published as a conference paper at EMNLP 2023 (long
Transitional adaptation of pretrained models for visual storytelling
© 2021 IEEEPrevious models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pretrained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.N
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
© 2021 IEEEThe natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing self-supervised approaches rely on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audiovisual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning.N
Multimodal Knowledge Alignment with Reinforcement Learning
Large language models readily adapt to novel settings, even without
task-specific training data. Can their zero-shot capacity be extended to
multimodal inputs? In this work, we propose ESPER which extends language-only
zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to
language model generations without direct supervision: for example, in the
image case our reward optimization relies only on cosine similarity derived
from CLIP, and thus requires no additional explicitly paired (image, caption)
data. Because the parameters of the language model are left unchanged, the
model maintains its capacity for zero-shot generalization. Experiments
demonstrate that ESPER outperforms baselines and prior work on a variety of
zero-shot tasks; these include a new benchmark we collect+release, ESP dataset,
which tasks models with generating several diversely-styled captions for each
image
SARS-CoV-2 hijacks neutralizing dimeric IgA for nasal infection and injury in Syrian hamsters1
ABSTRACTPrevention of robust severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infection in nasal turbinate (NT) requires in vivo evaluation of IgA neutralizing antibodies. Here, we report the efficacy of receptor binding domain (RBD)-specific monomeric B8-mIgA1 and B8-mIgA2, and dimeric B8-dIgA1, B8-dIgA2 and TH335-dIgA1 against intranasal SARS-CoV-2 challenge in Syrian hamsters. These antibodies exhibited comparable neutralization potency against authentic virus by competing with human angiotensin converting enzyme-2 (ACE2) receptor for RBD binding. While reducing viral loads in lungs significantly, prophylactic intranasal B8-dIgA unexpectedly led to high amount of infectious viruses and extended damage in NT compared to controls. Mechanistically, B8-dIgA failed to inhibit SARS-CoV-2 cell-to-cell transmission, but was hijacked by the virus through dendritic cell-mediated trans-infection of NT epithelia leading to robust nasal infection. Cryo-EM further revealed B8 as a class II antibody binding trimeric RBDs in 3-up or 2-up/1-down conformation. Neutralizing dIgA, therefore, may engage an unexpected mode of SARS-CoV-2 nasal infection and injury